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Summary 

Protein sequences encoded in three complete bacter- 
ial genomes, those of Haemophilus Influenzae, Myco- 
plasma genltallum and Synechocystis sp., and the 
first available archaeal genome sequence, that of 
Methanococcus jannaschll, were analysed using the 
blast2 algorithm and methods for amino acid motif 
detection. Between 75% and 90% of the predicted pro- 
teins encoded in each of the bacterial genomes and 
73% of the M. jannaschll proteins showed significant 
sequence similarity to proteins from other species. 
The fraction of bacterial and archaeal proteins con- 
taining regions conserved over long phylogenetic 
distances Is nearly the same and close to 70%. Func- 
tions of 70-85% of the bacterial proteins and about 
70% of the archaeal proteins were predicted with vary- 
ing precision. This contrasts with the previous report 
that more than half of the archaeal proteins have no 
homologues and shows that, with more sensitive 
methods and detailed analysis of conserved motifs, 
archaeal genomes become as amenable to meaning- 
ful interpretation by computer as bacterial genomes. 
The analysis of conserved motifs resulted in the pre- 
diction of a number of previously undetected func- 
tions of bacterial and archaeal proteins and In the 
identification of novel protein families. In spite of the 
generally high conservation of protein sequences, 
orthoiogues of 25% or less of the M. jannaschll 
genes were detected in each individual completely 
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sequenced genome, supporting the uniqueness of 
archaea as a distinct domain of life. About 53% of 
the M. jannaschll proteins belong to families of para- 
loguea, a fraction similar to that in bacteria with larger 
genomes, such as Synachocystls sp. and Escherichia 
coll, but higher than that in H. Influenzae, which has 
approximately the same number of genes as M. jan- 
naschii. Certain groups of proteins, e.g. molecular 
chaperones and DNA repair enzymes, thought to be 
ubiquitous and represented In the minimal gene set 
derived by bacterial genome comparison, are missing 
in M. jannaschii, indicating massive non-orthologous 
displacement of genes responsible for essential func- 
tions. An unexpectedly large fraction of the Ml. jan- 
naschll gene products, 44%, shows significantly 
higher similarity to bacterial than to eukaryotic pro- 
teins, compared with 13% that have eukaryotic pro- 
teins as their closest homologues (the rest of the 
proteins show approximately the same level of simi- 
larity to bacterial and eukaryotic homologues or 
have no homologues). Proteins involved in trans- 
lation, transcription, replication and protein secretion 
are most closely related to eukaryotic proteins, 
whereas metabolic enzymes, metabolite uptake sys- 
tems, enzymes for ceil wall biosynthesis and many 
uncharacterized proteins appear to be 'bacterial'. A 
similar prevalence of proteins of apparent bacterial 
origin was observed among the currently available 
sequences from the distantly related archaeal genus, 
Sulfolobus. It is likely that the evolution of archaea 
included at least one major merger between ancestral 
cells from the bacterial lineage and the lineage lead- 
ing to the eukaryotic nucleocytoplasm. 

Introduction 

Microbiology has entered a new era that is marked by the 
availability of complete genome sequences of bacteria, 
archaea and unicellular eukaryotes for comparative ana- 
lysis. At the time of writing (November, 1996), the comple- 
tely sequenced genomes include those of four bacteria, 
namely Haemophilus influenzae (Fleischmann et a/., 
1995). Mycoplasma genitalium (Fraser et al., 1995), 
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Mycoplasma pneumoniae (Himmelreich et al., 1996) and 
Synechosystis sp (Kaneko et al„ 1996), one archaeon, 
Methanococcus jannaschii {Buti et al., 1996) and one uni- 
cellular eukaryote, the yeast Saccharomyces cerevisiae 
(Gofteau etal., 1996). Thus, representative genome sequ- 
ences from each of the three primary domains of life 
(Woese et al., 1990) are already available, creating an 
opportunity for the reconstruction of the genome organiza- 
tion of the last common ancestor of all modern life forms. 
In the course of such a reconstruction, traces of major 
evolutionary events that shaped the genomes of bacteria, 
archaea and eukaryotes may be discovered. 

Analysis of complete genomes allows us to address bio- 
logically important problems that were previously not tract- 
able (Fleischmann et al., 1995; Koonin et al., 1996a; 
Koonin and Mushegian, 1 996). In particular, only the com- 
parison of complete gene sets enables the matching of all 
known biochemical pathways with specific genes. It is 
becoming possible to ascertain, given sufficient sensitivity 
of the methods used for sequence comparison, that cer- 
tain protein families are not encoded in a given genome 
and to seek alternative candidates for essential roles 
among the gene products. Comparison of complete 
gene sets should also allow researchers to define molecu- 
lar correlates of the unique lifestyles of different species. 

Complete genome sequences will yield biological 
insights only if the functions of the gene products are pre- 
dicted in as much detail as possible. Compared with origi- 
nal, conservative database searches, the use of additional 
computer approaches, specifically careful analysis of rela- 
tively weak sequence similarities by multiple alignment 
construction and motif detection, tends to produce a 
wealth of information on protein functions and relation- 
ships (Bork etal., 1992; 1995; Koonin era/., 1994; Ouzou- 
nis etal., 1995; Koonin etal., 1995, 1996b; Tatusov etal., 
1996). Based on the detailed prediction of protein func- 
tions, it becomes possible to reconstruct, at least in its 
general features, the biochemistry of a poorly studied bac- 
terial species {Fleischmann et al., 1995; Tatusov et al, 
1996). Comparison of the complete sets of proteins 
encoded by the genomes of phylogenetically distant spe- 
cies can be used to predict functions and genes that are 
essential for cellular life (Fraser et al., 1995; Mushegian 
and Koonin, 1996a; Koonin and Mushegian, 1996). 

The first sequenced archaeal genome, that of M. jan- 
naschii, appears to be particularly interesting and unusual 
compared with the bacterial genomes. The results of the 
original sequence analysis indicated that 56% of the pro- 
teins encoded in this genome showed no sequence simi- 
larity to any available protein sequences from other 
species (Bult et al., 1996). Clearly, this is caused in part 
by the paucity of sequence information from other archaea 
contained in current databases. With several complete 
genomes now available and genomes of other, diverse 



species covered extensively, the great majority of bacter- 
ial and eukaryotic protein families are probably repre- 
sented in the current databases. Thus, it is important to 
find out whether or not a larger fraction of the M. jannaschii 
protein sequences can be matched with sequences from 
the other two domains through the use of more sensitive 
computer methods. 

Here, we present the results of comparative analysis of 
the protein sequences from M. jannaschii and three com- 
plete bacterial genomes; comparisons with the yeast 
protein sequences were also made where this was consid- 
ered important, although detailed analysis of the yeast 
genome is outside the scope of this work. Our goal was 
to detect sequence similarities, including relatively weak 
but apparently biologically relevant ones, and to use 
them for functional prediction to the maximum extent pos- 
sible, with the aim of revealing common and distinctive fea- 
tures of archaeal and bacterial genomes. 

Results and discussion 

About 70% of the proteins encoded in each bacterial 
or archaeal genome contain highly conserved regions 

It has been observed previously that the great majority of 
proteins encoded in a bacterial genome, e.g. those ol 
Mycoplasma capricolum (Bork ef al., 1995), E. coli 
(Koonin etal., 1995, 1996b) and H. influenzae (Fleisch- 
mann et al., 1995; Casari er al., 1995; Tatusov et al., 
1996), show sequence similarity to proteins contained in 
databases and, more importantly, that tor most of the bac- 
terial proteins, the observed conservation is not limited to 
closely related species (Koonin et al., 1995; Tatusov era/., 
1 996). Here, we have evaluated the protein sequence con- 
servation for three complete bacterial genomes and one 
archaeal genome in a single computer experiment. The 
complete sets of protein sequences encoded in the gen- 
omes of H. influenzae, M. genitalium, Synechocystis sp. 
and M. jannaschii were compared with the protein sequ- 
ence databases using the wublastp program based on 
the blast2 algorithm, and the search output was further 
analysed for conserved motifs as described under Experi- 
mental procedures. This analysis revealed very similar 
distributions of protein sequence conservation for the bac- 
terial and archaeal genomes. Sequence similarity that we 
interpreted as biologically relevant, based on statistically 
significant alignments and motif conservation, was detec- 
ted for 73% of the M. jannaschii gene products and for 75- 
90% of the gene products in each of the three bacterial 
species (Fig. 1). Only 5% of the M. jannaschii proteins 
are conserved exclusively within archaea, whereas the 
remaining 68% have either bacterial or eukaryotic homo- 
logues, or both (Fig. 1). Thus, the fraction of proteins 
that contain regions conserved over large evolutionary 
distances is nearly constant at about 70% in the three 
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completely sequenced bacterial genomes and the first 
available archaeal genome (Fig. 1). It Is expected that, 
with the growth of the archaeal sequence data set, the per- 
centage of the M. jannaschii protein sequences that have 
detectable homologues in other species will reach the 
values observed for bacteria. 

On average, sequence similarity to the most closely 
related sequence in the database was considerably 
lower for M. jannaschii proteins compared with bacterial 
proteins; for M. jannaschii, the median score produced 
by blastp corresponded to a marginal statistical signifi- 
cance, whereas, for each of the bacteria, this score was 
highly significant (Table 1): hence, the greater role 
accorded to the more sensitive and selective blast2 algo- 
rithm as well as methods for motif analysis in the detection 
of homologous relationships for the archaeal proteins. 
Compared with the original studies, the increase In the 
detection of homologues for bacterial proteins was rela- 
tively modest but, in the case of M. jannaschii, the step 
up from 44% of gene products with detectable sequence 
conservation (Butt et al„ 1 996) to 73% significantly affects 
our view of the genome. 

The majority of bacterial and archaeal proteins are 
amenable to functional characterization by sequence 
analysis 

Sequence similarity analysis allowed functional prediction 
for the majority of bacterial and archaeal gene products 
(Fig. 2).The fraction of proteins for which only a general 
functional prediction could be made, e.g. of a particular 
© 1997 Blackwell Science Ltd, Molecular Microbiology, 25, 619-637 



Fig. 1. Sequence conservation among 
bacterial and archaeal protein sequences. M. 
jannaschii: 1 , similarity to both bacterial and 
eukaryotic proteins; 2, similarity to bacterial 
proteins only; 3. similarity to eukaryotic 
proteins only; 4, similarity only to proteins 
from other archaea; 5, no similarity to 
proteins trom other species. Bacteria; 1 , 




proteins and proteins from distantly related 
bacteria; 2, similarity to proteins from 
distantly related bacteria: 3. similarity to 
eukaryotic proteins only; 4, similarity to 
proteins from closely related bacteria only; 
5, no similarity to proteins torm other species. 
'Closely related bacteria' were delined as 
Proteobacterla tor H. Influenzae, low G + C 
Gram-positive bacteria lor M. genitalium and 
Cyanobacleria lor Synechocystis sp. 



enzymatic or binding activity, but for which attribution of 
a specific physiological role was not feasible, was signifi- 
cantly greater for M. jannaschii than for H. influenzae 
and M. genitalium, but very similar to that for Synechocys- 
tls sp. (Fig. 2). Presumably, this reflects the limitations of 
the current knowledge of archaeal and cyanobacterial bio- 
chemistry. On many occasions, a general indication of 
protein function Is possible even in the absence of detect- 
able sequence similarity, particularly via the identification 
of signal peptides and transmembrane domains. Only a 
small fraction of predicted gene products in each of the 
complete genomes remains totally uncharacterized after 
detailed sequence analysis (Fig. 2). 

A comparison of the distribution of proteins with predic- 
ted function by functional classes reveals both common 



Table 1. Best database search scores for archaeal and bacterial 
protein sequences. 



Species 


Average/median highest score 


BLASTP 


WUBLASTP (8LAST2) 


M. jannaschii 


174/83" 


309/148" 


H. influenzae 


566/415 


790/611 


M. genitalium 


312/208 


505/317 


Synechocystis sp. 


279/134 


411/216 



a. For an average-sized protein (about 300 amino acid residues), a 
score of 83 approximately corresponds to a P-value of 0.01 and is 
not necessarily indicative of biological relevance of the alignment. 

b. For an average-sized protein (about 300 amino acid residues), a 
score of 148 approximately corresponds to a P-value <10~ 7 , which 
typically indicates a biologically relevant alignment (except for pro- 
teins with a highly biased composition). 
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(Table 2). The latter observation, however, may largely be 
accounted for by the current insufficient knowledge of 
archaeal biochemistry. This situation is likely to change 
with further progress in the study of archaeal genomes 
and metabolism, perhaps resulting in a distribution of pro- 
teins by functions more closely resembling that in bacteria. 

Novel findings in the M. genitallum genome: automatic 
prediction of protein functions in complete genomes 
remains elusive 

An automatic computer system called GeneQuiz (Scharf ef 
a/., 1 994) has recently been used for rapid reanalysis of the 
M. genitalium genome sequence (Ouzounls ef a/., 1996). It 
has been indicated that the GeneQuiz results were corrobo- 
rated by detailed manual analysis and could be considered 
an important benchmark for assessing the completeness 
and accuracy of genome analysis (Ouzounis el a/., 
1996). It shown to be robust, this system would indeed 
represent major progress in comparative genomics. Since 
an analysis of the M. genitalium protein sequences was 
an integral part of our project on bacterial and archaeal 
genome analysis, we compared our results with those of 
the GeneQuiz. 

Our analysis of the M. genitalium gene products resul- 
ted in 53 sequence similarity-based functional predictions 
that have not been reported originally (http://://www.ncbi. 
nlm.nih.gov/Complete_Genomes/Mgen/Novel_Findings). 
All of these functional assignments were supported by 
statistically significant sequence similarity and/or motif 
conservation. In most cases, sequence similarity to func- 
tionally uncharacterized proteins in the databases has 

Table 2. Functional classification of proteins encoded In complete archaeal and bacterial genomes. 



Number of proteins 



Functional category" 


M. jannaschii 


H. influenzae 


M. genitalium 


Synecnocysfis sp. 


Amino acid metabolism and transport 


102 (5.9%) 


162 (9.5%) 


17 (3.8%) 


15B (5.0%) 


Replication, recombination and repair 


87 (5.0%) 


110 (6.4%) 


37 (7.9%) 


82 (2.6%) 


Transcription 


22 (1.3%) 


30 (1 .8%) 


12 (2.8%) 


26 (0.8%) 


Energy conversion 


162 (9.4%) 


141 (B.3%) 


37 (7.9%) 


193 (6.1%) 


mRNA translation and ribosome biogenesis 


114(6.6%) 


125 (7.3%) 


97 (20.7%) 


118 (3.7%) 


Outer membrane and cell wall 


36 (2.1%) 


105 (6.2%) 


9(1.9%) 


131 (4.1%) 


Carbohydrate metabolism and transport 


10 (0.6%) 


80 (4.8%) 


14 (3%) 


110(3.5%) 


Nucleotide metabolism and transport 


33 (1 .9%) 


73 (4.3%) 


28 (6%) 


61 (1 .9%) 


C of actor metabolism 


69 (5.1%) 


70 (4.0%) 


8 (1 .7%) 


107 (3.4%) 


Chaperones 


12 (0.7%) 


53 (3.1%) 


15(3.3%) 


68 (2,1%) 


Inorganic ion transport 


48 (2.8%) 


52 (3.0%) 


10(2.1%) 


115(3.6%) 


Lipid metabolism 


13 (0.7%) 


40 (2.3%) 


9(1.9%) 


62 (2.0%) 


Secretion 


23 (1.3%) 


35 (2.1%) 


20 (4.3%) 


45 (1.4%) 


General prediction only 


494 (28.5%) 


360 (19.4%) 


95(20.1%) 


974 (30.7%) 


Total predicted function 


1235(71.3%) 


1457 (85.6%) 


408 (87.2%) 


2254(71.1%) 


a. As previously (Tatusov etal., 1996; Koonin s 
cular class of metabolites (e.g. nucleotide comp 


md Mushegian, 1996), genes coding for proteins 
onents or amino acids) and in the expression rec 


implicated in the 
lulation of a given 


membrane transport of a parti- 
functional class of genes were 



included in the respective class. 
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Fig. 2. Functional prediction for bacterial and archaeal proteins. 

1 . specific prediction including assignment to a functional class, 

2, general functional prediction (e.g. enzymatic or binding activity) 
based on sequence similarity; 3, general functional prediction 
based on the analysis ot structural features (e.g. signal peptides or 
transmembrane domains); 4. no functional prediction in spite of 
detected sequence conservation; 5, n 



and distinctive features between M. jannaschii and bacteria 
(Table 2). The fraction of proteins in several categories, 
namely translation, transcription, replication, energy con- 
version, cofactor metabolism and inorganic ion transport, 
is similar for bacteria and M. jannaschii, whereas other 
categories seem to be underrepresented in M. jannaschii 



been detected previously, but functional prediction became 
possible only upon more detailed motif analysis. 

From the six GeneQuiz predictions considered most 
interesting and specifically discussed by Ouzounls ef al. 
(1996), only two (MG333 and MG385, both phospho- 
diesterases) could be confirmed. The remaining four 
cases illustrate problems with the GeneQuiz analysis. 
The MG123 gene product, identified as an arginine deimi- 
nase by Ouzounis et al. (1996), cannot possess this 
enzymatic activity for several reasons. The observed simi- 
larity to arginine deiminase from Mycoplasma arglnini is 
not highly significant statistically (P-value of 0.04) and is 
much lower than that between arginine delminases from 
other Mycoplasma species and distantly related (Gram- 
negative) bacteria. Furthermore, the region of similarity 
does not include the sequence motifs conserved In the 
arginine deiminases and does not occupy a similar loca- 
tion in MG1 23 and arginine deiminases. Most importantly, 
MG123 is predicted to consist mostly of non-globular 
domains and is therefore unlikely to possess any enzy- 
matic activity. Thus, MG123 is not the second enzyme of 
amino acid metabolism in M. genltalium. The first enzyme 
in this category, serine hydroxymethyltransferase (MG394), 
is likely to be involved only in folate metabolism (Mushegian 
and Koonin, 1 996a); amino acid metabolism may not be 
represented In M. genltalium at all. 

The similarity between MG237 and isoleucyl-tRNA 
synthetases could not be confirmed In our analysis, and 
the origin of this observation remains uncertain to us. 
The case of MG449 and phenylalanyl-tRNA synthetase 
illustrates a case when the reported similarity is valid but 
the Implications are not straightforward. The protein in 
question is indeed highly similar to the N-terminal region 
of Phe-tRNA synthetase from M. genltalium and other bac- 
teria. Inspection of the three-dimensional structure of the 
Thermus thermophilus Phe-RS (Mosyak et al., 1995) indi- 
cates, however, that this region folds into a distinct domain, 
which shows sequence similarity to a variety of small pro- 
teins and domains of larger proteins including, among 
others, bacterial Met-tRNA synthetases, yeast quadruplex 
DNA-binding protein and human cytokine EMAP (Frantz 
and Gilbert, 1995: Koonin et al., 1996b; Simos et al., 
1996; E. V. Koonin and A. G. Murzin, unpublished obser- 
vations). The common function of all these domains 
remains unclear but obviously has little relation to the 
enzymatic activity of Phe-RS. 

In the last of the six examples, Ouzounis et al. (1996) 
characterize the MG468 gene product as 'DNA polymer- 
ase I' noting that the sequence is identical to that of 
MG262, and the relationship between the two is unclear. 
In fact, the available genome sequence of M. genitalium 
encodes only one of these proteins, namely MG262. The 
other one seems to be a fictitious 'duplication' introduced 
in the course of the original annotation. Furthermore, 
© 1997 Blackwell Science Ltd, Molecular Microbiology. 25, 619-637 
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MG262 is not a DNA polymerase, it is a homologue of the 
N-terminal, 5' -3' exonuclease domain of DNA polymerase 
and can be confidently predicted to possess exonuclease 
activity (Koonin and Bork, 1996; Himmelreich ef al., 1996). 

Altogether, of the 21 predictions of novel protein functions 
made by GeneQuiz, only eight could be fully corroborated 
(E. V. Koonin, unpublished observations). In contrast, 
most of the predictions produced by our present analysis 
have not been reported by GeneQuiz. It appears, therefore, 
that currently available automatic systems for genome 
analysis are still far from perfect, and expert evaluation 
of functional predictions remains the limiting step. 

Novel findings in the Methanococcus jannaschii 
genome: filling in major gaps in cell physiology and 
predicting enzymatic activities with as yet unknown 
cellular role 

The analysis of the protein sequences encoded in the M. 
jannaschii genome resulted in functional prediction for 
382 gene products based on previously undetected sequ- 
ence conservation; a selection of such findings is shown in 
Table 3. Some of the new functional predictions fill impor- 
tant gaps In the original identifications. For example, it has 
been Indicated that M. jannaschii encodes aminoacyl- 
tRNA synthetases (aaRSases) for only 16 amino acids, 
with no enzymes Identified for glutamine, asparagine, 
cysteine and lysine (Bult ef al., 1996). Asparaginyl-tRNA 
and glutaminyl-tRNA are thought to be formed via trans- 
animation of the aspartyl-tRNA and glutarnyl-tRNA, 
respectively, a mechanism previously identified in Gram- 
positive bacteria (Strauch era/., 1988) and in archaea 
(Curnow et al., 1996). The mechanism of cysteine and 
lysine activation for incorporation into protein has remained 
unknown. Our analysis revealed a protein, MJ0539, that, 
while only distantly related to cysteine aaRSasBs, contains 
the principal conserved motifs typical of the aaRSase 
class I (Eriahi et al., 1990). We predict that this protein 
is responsible for cysteine activation in M. jannaschii. 

Furthermore, a duplication of the alpha-chain of the 
phenylalanine aaRSase was detected (Table 3). One of 
the two diverged copies, namely MJ0487, shows signifi- 
cant similarity to lysine aaRSases from several bacterial 
species and may be predicted to catalyse Lys-tRNA for- 
mation, thus completing the repertoire of aaRSases for 
M. jannaschii. 

Most of the novel functional predictions stem from 
the identification of relatively weak sequence similarities 
combined with the detection of known or novel conser- 
ved motifs. Certain classes of proteins, e.g. ATPases, 
proteins containing the helix-turn-helix DNA-binding 
domain and SAM-dependent methyltransf erases (see 
Table 5), which are best recognized by motif analysis, 
were significantly underpredicted in the original report on 
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Table 3. Novel functional predictions tor M. jannaschii gene products (examples). 

Best functionally relevant hll from other species, 
P-value: % identity/alignment length; conserved 
MJ no. Paralogues motifs 



Predicted lunclion/actlvlty end 



MJ0050 None 
MJ0052 None 



M JO 165 MJ0616 



gil1230658 (S. cerevlsiae): 
2.06-19; 22%/395 

gil1221579(H. influenzae); 0.047; 27%/95; 

a catalytic motif conserved In sulphurtransferases 

and phosphatases 

(E. V. Koonin, unpublished observations) 
MYOP_BOVIN; 8.0e-19; 29%/244 
SOHB_HAEIN: 0.0013; 2S%/178; conserved 
motif around the predicted catalytic serine 
(E. V. Koonin, unpublished observations) 
PUR6_METTH; 2.1e-06; 31%/128 



MJECL20 (nearly 
identical gene on an 
extrachromosomal 
element) 
MJ0398, MJ1098 



MJ0357 None 



None 

Eight class I aaRSases 



PIR/JC13B3 (Pyrobaculum organotrophum); 
2.0a-1 1 ; Intron and Intein endonuclease signature 



SECB_ECOLI; 0.036; motif o 
proteins from different bacteria 
S61G_YEAST; 1.48-OS 
gll496733 (Sulfclobus solfataricus); 7.3e-09 

gll496154 {Mycoplasma pulmonis); 0.027; 
modified 'HIGH' and 'KMSKS' motifs 



Glulamate or histidine decarboxylase 

Sulphurtransferase (rhodanese homologue) 
probably Involved In cysteine biosynthesis 
(Table 9) 



Phosphosibosylaminoimidazole (AIR) 
carboxylase; compared with homologues 
from other species and with MJ0616, contains 
an additional, uncharacterizad N-terminal 
domain 

Phosphotyrosyl phosphatase 



Endonuclease also containing a predicted 
DNA-blndlng HTH domain; MJ0314. MJ0398 
and MJ1098 are newly detected, 'stand-alone' 
genes encoding putative endonucleases that 
are homologous to the 16 inteins contained in 
M. jannaschil genes (Bull etal.. 1996) 
Component of protein secretion machinery 

Signal recognition particle subunlt SEC61 
Translation elongation factor lb; distantly related 
to eukaryotlc translation elongation factors 
Cystelnyl-tRNA synthetase; tiiis putative aaRS is 
unusual In that It shows only limited similarity to 
other class I aaRS and is somewhat more similar 
to methlonyl-tRNA synthetases and glutamyl- 
tRNA synthetases; however, analysis of the 
entire set of M. jannaschii aaRS suggests 
specificity for Cys 



the M. Jannaschii genome (Bult etal., 1996). Examples of 
conserved motifs in two families of M. jannaschii proteins, 
for which no function has been reported previously, are 
shown in Fig. 3. 

The first family includes 1 3 proteins predicted to possess 
nucleotidyltransferase (NTase) activity (Fig. 3A). Only 
some of them showed moderate similarity to bacterial 
aminoglycoside-NTases (e.g. kanamycin adenylyltrans- 
ferase), but all the sequences contain a prominent con- 
served motif, which is a signature of a large superfamtly 
of nucleotidyltransferases including, in addition to the 
aminoglycoside-NTases, poly(A) polymerases, terminal 
nucleotidyltransferases and eukaryotic DNA polymerase 
beta (Holm and Sander, 1995; Yue et al„ 1996; Martin 
and Keller, 1996). A unique feature of the putative M. 
jannaschii NTases is their small size (with the exception 
of two larger proteins, MJ1086 and MJ0694), close to 
that of the N-terminal domain of the kanamycin NTase 
whose tertiary structure has been resolved (Sakon et a/., 



1993). Three similar small putative NTases with a high 
similarity to the M. jannaschii NTases were Identified in 
Synechocystis sp., and one putative NTase with a moder- 
ate similarity to the M. Jannaschii proteins was found in H. 
influenzae (Fig. 3A); no functional predictions were pre- 
viously available for any of these proteins. 

The conserved motif is unique for NTases, thus allowing 
a confident prediction of this enzymatic activity for each of 
the M. jannaschii proteins belonging to the family. Beyond 
that, however, the sequence similarity between the archaeal 
proteins and nucleotidyltransferases with known specifi- 
city is insufficient to predict their actual function or the sub- 
strate^). Several putative NTases in M. jannaschii are 
closely related to each other, suggesting a series of rela- 
tively recent duplications, which, in five cases, appeared 
also to have included an adjacent gene; the two genes 
may together comprise a novel type of mobile element 
(see legend to Fig. 3A). 

The second novel protein superfamily, which includes 
© 1997 Blackwell Science Ltd, Molecular Microbiology, 25, 619-637 
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Table 3. Continued 



MJ0703 

MJ0817 

MJ0839 
MJ0902 
MJ0973 



MJ1318 

MJ1437 
M J 1440 
MJ1646 
MJ1660 



MJ0411, MJ1532 

None 

None 
None 
MJ0D66 



MJ1030. MJ1179, 
MJ0541, MJ1062, 
MJ1255, MJ0951 
None 
MJ1624 



MJ1207 MJ1530 



MJ0544, M1057 
MJ0008, MJ0133, 
MJ1188, MJ1614 
MJ1417 



MJ1594 

MJ1087. MJ1104, 
MJ1427, MJ0969 
MJ1109, MJ1366, 
MJ1655, MJ0204 
MJ0487 and seven 



Best functionally relevant hit from other species: 
P-value; % Identity/alignment length; conserved 
motifs 



Predicted function/activity a 



HIS4..STRCO; 2.6e-17 

gill 143229 (Neisseria gonoirfioae); 2.0e-15; 
28%/ 190 

PRI1.. MOUSE: 1.98-08; 27%/222 
LEP3_ECOLI; 0.044; 25%/90 
CYSH_SALTY; 9.7e-12; 30%/166; 
pyrophosphatase-speclflc P-loop (PP-motif) and 
additional motif conserved in PAPS reductases 
and ATP aulphurylases 
YGR277c (S. cerevlslae); 2.3e-14; 34%/142; 
nucleotlde-bindlng motifs distantly related 1o the 
'HIGH' and 'KMSKS' motifs in class I aaARSases 
SC65_YEAST; 2.1e-07; 3S%/85 
PIR/JC2485 (Laclococcus laclis); 0.00066; 26%/ 
218; bacterial DNA primase motHs 



PAIA_BACSU; 1 .4e-08; 35%/87; 
acetyltransferase motifs A and B 
ALG5_YEAST: 9.9e-14; 29%/25S 
YXDH_BACSU; 5.0e-10; 35%/262 

glH 142617 (B. subtills); 3.80-07; 32%/101; serine 
protease catalytic motifs 

YJJG..ECOLI; 2.6e-13; 25%/218; dehalogenase- 
related hydrolase motifs 

KHSE_BACSU; 8.0e-07; 27%/274; kinase motifs 

PYRS_HUMAN; 5.3e-09; 31%/132; 
phosphoribosyltransferase motifs 
SYFB YEAST; 3.6e-15; 33%/165: 



Phosphorlbosylformimino-5-aminoimidazole 
carboxamide ribotide (PAICAR) isomerase 
Phosphatidylserine decarboxylase 

DNA primase subunit 
Leader peptidase 

3'-phosphoadenoslne-5'-phosphosulphate (PAPS) 



Nucleotidyl(cytidylyl?) 



Signal recognition particle subunit SEC65 
DNA primase (bacterial DnaG homologue): 
MJ1624 is a small protein with more distant 
similarity to DnaG that, however, retains the 
motifs typical of bacterial prlmases 
Acetyltransferase 



Periplasms serine protease; homologue of the 
protease domain of bacterial Lon protease 
without the ATPase domain 
Hydrolase 



Orotate phosphorlbosyltransterase 

Phenylalanyl-tRNA synthetase; this gene is 
clearly a duplication of MJ04B7; bolh MJ0487 and 
MJ16B0 are most similar to Phe-aaRS from other 
species; however, MJ0487 also shows significant 
similarity to lysyl-aaRS and is likely to be specific 
for lysine 



12 proteins from M. jannaschii, is brought together by a 
motif with two conserved histidines and a conserved 
aspartic acid, which appear to be diagnostic of metal- 
dependent hydrolase activity (Fig. 3B). Unlike the nucleo- 
tidyltransferase family, this diverse group of proteins is 
ubiquitous in bacterial, archaeal and eukaryotic genomes 
but, as the functional significance of the conserved motif 
has not been recognized so far, most of them have been 
described only as 'hypothetical proteins'. However, two 
of the Synechocystis sp. proteins belonging to this family 
show a highly significant similarity to eukaryotic glyoxa- 
lases, which are well-characterized Zn enzymes (Manner- 
vik and Ridderstrom, 1993). It is of particular interest that 
three of the proteins in this superfamily (MJ1236, MJ0162 
and MJ0047) are highly similar to subunits of the eukaryo- 
tic cleavage and polyadenylation specificity factor (CPSF) 
(Chanfreau et a/., 1996; Jenny et at., 1996; Stumpf and 
Domdey, 1996), It remains to be determined whether or 
© 1997 Blackwell Science Ltd, Molecular Microbiology, 25, 619-637 



not these proteins are involved in mRNA processing in 
M. jannaschii but, regardless of this, the conserved motif 
defines the likely catalytic centre of the CPSF. 

As shown previously for H. influenzae, many of the bio- 
chemical pathways in a poorly characterized bacterium 
may be reconstructed on the basis of a comparison with 
a well-characterized, related bacterium, in that case £. 
coli (Fleischmann ef a/., 1995; Tatusov et al., 1996). 
This task is more complicated for an archaeon as there 
is no representative that has been studied as thoroughly 
as bacterial models. An initial reconstruction of the basic 
metabolic pathways for M. jannaschiibas been performed 
using the WIT system (http://www.cme.msu.edu/WIT/, 
and R. Overbeek, personal communication). 

Here, we did not aim at complete reconstruction of the 
M. jannaschii biochemistry. Nevertheless, the detailed 
analysis of protein sequences and, in particular, the unex- 
pectedly high degree of sequence conservation between 
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many archaeal and bacterial metabolic enzymes make it 
possible to delineate the enzymatic complement for a num- 
ber of biochemical processes in M. jannaschii- in several 
cases, significantly extending the Initial reconstructions 
(Table 4). 



A 








Consensus 






r.Du-iruu . u . - 


HJ012 6 


32- 


AIFCSYARNEQTET- 


-.-.DIDILIDYYE 






AIFGSYAREEQKET- 


- SDIDILIDYYE 


MJ01 28 




AIFGSYARNEOTEK- 


- SDIDILVETYE 


KJ1379 


28 : 


ALFGSY ARGEOTFF - 


- - flDIDIMVBFDE 


KJ1215 




AirCSYARGEQKET- 


- -SDIDIMVEFYE 




27" 


SITGSYARNEQKET- 


- SDIDILVRFGE 


MJ0141 


24 1 


1LFGSYARGDYDEE 


PDVDVLILVKE 


MJ0604 


25: 


ILFGSYABKDYTEE- 


--(JDIDHitVGDV 


MJ1547 


8: 


LLYGSYAKNEYTKR ■ 


- KDIDICLVGVD 


MJ1305 


27; 


Tl/FGSYARGTAVEY- 


- - "DVDLLVIAKN 


MJ1112 


51: 


LLVGSSAItNTNLKD • 


DYDIDirVLFDK 


HJ10S6 




GVKGSLIJjKLNNKN- 


--SDIDFVIYGKD 


HJ0694 


196: 


GKIAEAIVKHSHGGELEDYDLDVIVKFAE 


HI0073 


31: 


WAFGSRVKGKAKKY- 


--KDLDLAI.tr. BE 


elrl241 


31: 


ALFGSFLRnDFDLD- 


- HFDIDVLVSEPN 


8111504 


24 ! 


ALFGSILRPHFHSD- 


- -fIDIDILIETAP 


ssl2749 


27: 


A'LFG STAR D E AG P 1 1 - 


- - rDVDILVSFDG 


KANO_STAAU 


19: 


GVYGSLGRQTDGPY - 


--SDIEMMCVMST 


25fi.6_JHOHAN 


391: 


VRGGSTA.KGTALKT- 


-gsdadlvtvfhns 


B 









YSH1 59 
CPSF-7Jk 62 
CPSF-lOOk 51 

MJ1236 231 

MJ0162 45 

MJ0047 51 

MJ0534 57 

MJ0732 56 

MJ074B 53 

MJ0861 56 

MJ1163 36 

HJ0296 74 

MJ088B 10 

HJ0301 71 

MJ0448 54 

MJ1502 61 

MJ1374 56; 

HI1274 
HI1663 
HI0061 

MG139 

B110217 
S110550 
S110647 
Slr0050 
slr0551 
8111019 
S110514 
slrl259 
S111036 



KVDILLISHFHLDKAASLPYV 
EIDt.LLISHFKLDHCCALPWF 
Q ID A VLLSH PD PLH L« ALP Y A 

DLDAVI VI KAHLDHCG FIPGL 
AVDAyi VSHAH LDKCGAIPFY 
DVDKvriTHAHT.DHSGALPVL 
KLDYIISNHrspDIINKCIEKL f lavoprotein 
DLDYIIVHHVEKDHSGCVDKL flavopiotein 
KIDVIVQNHVEKOHSGALPEI flavoprotein 
EVKAIVL-'MGHLDHI G AVPKL 
GVEVIAViHGHADHLCNAEEL 
DIDWimULHYOHIENHPir 
DIDLIIN (HCHFDHTSADYLI 
SIDMILSHNHrDHIGGLFGI 
GFDYIVLiiKGHYDHCDGLKYV 
KINHIFITHLHGDH ILGIPGL 
KSNVITI ITHYH YDHYTPrFDD 
tl 



TIEAVLLTH EH DDHTQG VSAF 
SLKVLLLTHGHI.DHVGAANQL 
VLEKLILSHDDNDHAGGASTI 

KVKALTrmGHEDHIGGVPYL 

SLDYLIVKHTEPDKSGUPDL 
RIDYLIVSHTEPDHSGLVKPI flavoprotein 
DtTDlYvSHLHSDHVGGLEYV f lavoprotein 
QLTRIfl-HLHGDHI FGLMGL 
KIKCHyVlMGH EOH 1 GG I AY H 
DLVTijniTHHHGDHVGAHREL glyoxalase 
T^DLSwcSHAHRDHGLGLWQr glyoxalase 
KLTFCLESHVHADHITGAGKL 
VSADIFFTHSHWDHIOSFPFF 



Limited gene orlhology and non-orthologous 
displacement of numerous genes between archaea 
and bacteria 

An important measure of closeness between genomes of 
any two species is the fraction of the genes in each of the 
genomes that show similarity to genes from the other 
genome and, more specifically, how many of these genes 
are orthologues. Orthologues are genes related by vertical 
descent from a common ancestor and responsible for the 
same function in different species, in contrast to para- 
logues, which are hornologues related by duplication and 
having similar but not identical functions (Fitch, 1970). 
We have observed previously that the great majority of 
H. Influenzae genes have orthologues in E. coli, making 
the smaller gene complement of the former almost a sun- 
set of the larger gene complement of the latter (Tatusov ef 
al., 1996). Furthermore, even in the case of the phylo- 
genetically distant M. genitalium, about one-half of the 
genes had orthologues in both H. influenzae and £. coli 
(Mushegian and Koonin, 1996a; Koonin and Mushegian, 
1996). We delineated the sets of likely orthologues of M. 
jannaschii genes in the three bacteria with completely 
sequenced genomes and in yeast (Table 5). Predictably, 
the fraction of M. jannaschii genes that have bacterial 
orthologues Is considerably lower than observed even 
between phylogenetically distant bacteria, with the great- 
est number of orthologues detected in Synechocystis sp., 
a free-living, autotrophic bacterium with a relatively large 
genome. The fact that, in a comparison of the M. jannaschii 
protein sequences with those encoded in individual, com- 
plete bacterial and eukaryotic genomes, orthologues were 
found for 25% at most, testifies to the uniqueness of the M. 



Fig. 3. Previously uncharacterizod families of M. jannaschii 
proteins and their conservation in bacteria. 

A. Putative nucleotidyltransferases. The alignment, generated using 
the macaw program, includes all the sequences of 'MJ-type' 
nucleotidyltransferases encoded in the completely sequenced 
bacterial and archaeal genomes as well as the kanamycin 
nucleotidyltransferase from Staphylococcus aureus 
(KANU_STAAU) and human 2'-5' oligoadenylate synthetase 
(25A6_HUMAN). The position of the first residue of each aligned 
segment In the respective protein sequence is indicated by a 
number. The consensus includes amino acid residues conserved in 
all sequences (upper case) and those conserved in the majority of 
the sequences (lower case). U indicates a bulky hydrophobic 
residue (I, L, V, M, F, Y, W): O indicates a small residue (G, A, S); 
+ indicates a positively charged residue (K, R); and - indicates a 
negatively charged residue (D, E). Asterisks Indicate residues 
shown to interact with ATP in the kanamycin 
nucleotidyltransferases (Pedersen era/., 1895). 

B. Putative Zn-dependent hydrolases. The alignment of the 
conserved motif was constructed using the cap and most programs 
and includes ail the proteins encoded in the completely sequenced 
bacterial and archaeal genomes, which belong to the superfamily. 
In addition, the sequences of the CPSF subunits from yeast 
(YSH1) and from humans (CPSF-73k and CPSF-100k) are shown. 
The designations are as in (A). 

© 1997 Blackwell Science Ltd, Molecular Microbiology, 25, 619-637 
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Table 4. Reconstruction ot selected metabolic pathways In Methanococcus jannaschii. 
Pathways Bacterial genes and their functional equivalents in M. jannaschii " 



Glycolysis 

(the downstream 

portion) 

TCA cycle derivative 



Cobalamin biosynthesis' 1 



Biotin biosynthesis 



Riboflavin biosynthesis 



Triosephosphate isomerase tpiA (MJ1S28), glyceraldehyde 3-phosphato dehydrogenase gap (MJ1146), 
3-phosphoglycerato kinase pgk (MJDS41), phosphoglyceromutase yibO ( MJ1612 ), enolase eno (MJ023'>) 
pyruvate kinase pyk (MJ0108) 

Malate dehydrogenase mdh (MJ1425), lumarase tumC (MJ1294), fumarate reductase tlavoprotein IrdA 
(MJ0033) and iron-sulphur protein IrdB (MJ0092) subunits, succinyl-CoA synthetase alpha sucD (MJ0210) and 
beta sucC (MJ1246) subunits 

Acetoacyl-CoA thiolase ERGW (MJ1549), 3-hydroxy-3-methylglutaryl-CoA synthase HMGS (MJ1546) 
3-hydroxy-3-melhylglutaryl-CoA reductase HMG1 (MJ0705), mevalonate kinase ERG12 (MJ1087), 
phosphomevalonate kinase ERGS (MJ1427 and/or MJ0969), dlphosphomevalonate decarboxylase ERGW 
(MJ0102). isopenthenyl-diphospnate dolta-isomerase (?), geranyl diphosphate synthase ERG20 (MJ0B60) 
Aspartate oxidase (quinolinate synthetase B) nadB (MJ0033), quinolinate synthetase A nadA (MJ0407). 
qulnollnate phospharlbosyllransferase nadC (MJ0493), nicotinate-nucleotide adenylyltransferase nadD (7), 
deamido-NAD:ammonia ligase {NAD synthetase) nadE(MJ 1352) 

Rhodanoso-liko sulphurtranslerase ( MJ0OS2 ), serine-pyruvate aminotransferase (MJ0959) 
Glutamyl-tRNA reductase hamA (MJ0143), glutamate 1 -semlaldshyde aminotransferase hem! (M.J0603) , 
5-aminolaevulinale dehydratase hemB (UJ0643), porphobilinogen deaminase hemC (MJ0569), 
uroporphyrinogen III synthetase /lemD (MJ0994) 

Uroporphyrinogen III methylase cysG/cobA (MJD965). procorrin-2 methylase cbiUcobl (MJ0771 ),llcobG (?), 
precorrin-3B methylase cbiH/cobJ (MJ0B13), precorrin-4 methylase cbiF/cobM (MJ 1578), precorrin-fi A 
reductase cbU/cobK (MJ0552), precorrln 6B methylase cbiE/cobL (MJ1522), precorrin 6B decarboxylase 
cbiT/cobL (MJ0391), piecorrln-8x isomerase cbiC/cobH (MJ0930), cobyrlnic acid a,c-diamide synlhase obiA/ 
cobB (MJ1421), cobalt insertion protein ./cobM (MJ0908), cob(l)alamin adenosyltransferase cobA/cobO 
(MJ1157). cobyrlc acid synthase cblP/cobO (MJ0484), cobinamide synthase cbiB/cobD (M.I1314), cobinamide 
kinase/coblnamide phosphate guanylyltransferase cobU/cobP (7), cobalamin synthase cobS/cobV (MJ1 438), 
nicotinate-nucleotide: dimelhylbenztmidazole phosphorlbosyltransferase cobT/cobU (MJ1598) 
Plmeloyl-CoA synthetase bioW (MJ12S7), 7-keto-8-amlno P Blaigonate synthetase b;'oF(MJ129B), 
7.8-diaminopelargonate aminotransferase hioA (MJ1300), delhinbiotin synthetase bioD (MJ1299), biotin 
synthetase bioB (MJ1296), biolin-|acetyl-CoA carboxylase] holoenzyme synthetase birA (MJ161 9, no 
transcription regulation domain) 

ribD pyrimidine deaminase domain ( MJO430 or MJ1102), ribD pyrimidlne reductase domain (MJ0G71 ), 
3,4-dihydroxy-2-bu1anone-4-phosphate synthetase ribB (MJ0055), 6,7-dimethyl-8-ribitylluma7.in synthetase ribE 
(MJ0303), riboflavin synthase ribF ( MJ1184 ?), FAD synthetase FAOl (MJ0D66 and/or MJ0973) 



a. Unless otherwise noted, the pathways are modelled after E. coli (Neidhardt ot al , 1 996). and gones are nam B d altar the E coli orthologues. 
The genes are listed in the order in which the reactions proceed. The proposed non-orthologous gene displacements are underlined. 

b. Based on thB mevalonate pathway in halophiles (Kates, 1993). the gene names are from S. ccrevisiae: no candidate for tsopenthonyl dipho- 
sphate delta-isomerase has been identified. 

c. Systems for de novo biosynthesis of all nucleotides and amino acids, except lor cysteine, from C1 compounds have been predicted in M. jan- 
naschii, based on the high similarity to well-characterized bacterial enzymes (Bult era/., 1996; http:/rwww.cme.msu.edu/WIT/). Cysteine bio 
synthesis in M. jannaschii appears to occur by a pathway that is different from the common bacterial one. as M. jannaschii tacks homologues 
of O-acetylserine (thiol) lyase (cysK or cysM). The likely pathway of cysteine biosynthesis, however, could be predicted based on the reversal 
of the cysteine catabolism pathway in eukaryotes (Nagahara el al., 1995). 

d. This pathway is modelled after Salmonella typhimurium based on the similarity with Pseudomonas denitrificans, as in Roth el al. (1993) and 
both gene symbols are included where appropriate. Orthologues of P. denitrificans genes cobGand cobP are missing in M. jannaschii. Several 
candidates with each of the activities required for these reactions (Fe-S oxidoreductase. kinase and GMP transferase) are predicted in the M 
jannaschii genome, which also encodes the orthologues of S. typhimurium genes, cbiD (MJ0022), cbiM (MJ1569) and cbiG (MJ1 144), genes 
with unknown functions implicated in cobalamin biosynthesis. 

e. Homologuo of B. subtilis gene, bioW, not found in E. coli. 



jannaschii 'genome as a representative of a distinct domain 
of life. Given the presumed affinity of archaea with eukaryo- 
tes (Woese et al., 1990), it is, however, unexpected that 
the number of M. jannaschii genes that have orthologues 
in yeast is smaller than the number of genes with ortho- 
logues in Synechocystis sp. (Table 5, and see below). 

Previous comparisons of the organization of ortho- 
logous genes in bacterial genomes have shown that only 
a few essential operons are conserved over large phylo- 
genetic distances (Mushegian and Koonin, 1996b). The 
conservation of the genome organization is even lower 
in M. jannaschii and is limited mostly to ribosotnal protein 
operons, genes for two subunits of the DNA-dependent 

© 1997 Blackwell Science Ltd, Molecular Microbiology. 25, 619-637 



RNA polymerase and some ion transport operons. In the 
comparisons of the M. jannaschii gene organization with 
that of bacteria, the longest common string of genes 
contained only four genes In a row, and the longest univer- 
sally conserved blocks consisted of only three genes (data 
not shown; URL:http://www.ncbi.nlm.nih.gov/Complete_ 
Genomes/Gene_Strings). 

Analysis of the representation of different functional 
categories by orthologues indicates, along with analogies, 
major distinctions between M. jannaschii and bacteria. In 
particular, such key components of the archaeal replica- 
tion machinery as the DNA polymerase, ATPases involved 
in replication initiation and ATP-dependent DNA ligase 
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Table 5. Sequence similarity and orlhoiogous relmlonshlps between 
Methanococcus jannaschii gones and genes from other species with 
completely sequenced genomes. 

Number of M. jannaschii genes/% of the total (1731) 

Coding tor proteins 
with significant 

Species and number sequence similarity Orlhologues 



H. influenzae (1703) 561(32.4) 334(19.3) 

M. genitaliurn (468) 209(12.1) 123(7.1) 

Synechacyslia sp. (3168) 676(391) 454(26.2) 

S. cerevisiae (5885) 544(31.4) 331 (19.1) 



only have orthologues In eukaryotes. Conversely, M. jan- 
naschii encodes no orthologues for the critical proteins ot 
the bacterial replication apparatus, namely three subunits 
of DNA polymerase III (DnaE, DnaX and DnaN), the'prln- 
clpal replicative helicase (DnaB), ATPase involved in repli- 
cation initiation (DnaA), NAD-dependent DNA llgase and 
two DNA-binding proteins (Ssb and Dbh). 

DNA repair systems and chaperone-like proteins are 
slgnlllcantly underrepresented In M. jannaschii compared 
with both bacteria and yeast (Table 2 and data not 
shown). Major repair systems, e.g. the bacterial UvrABC 
excisionase or the mismatch repair system, which Is com- 
mon to bacteria and eukaryotes, are missing. M. jannaschii 
encodes a number of predicted nucleases, helicases and 
ATPases, some of which are homologues of bacterial 
repair enzymes with known activity but poorly character- 
ized physiological role, e.g. SbcC and SbcD (Table 3), 
but others could only be placed in the 'general prediction' 
category. It appears almost certain that some of these 
predicted enzymes belong to repair systems, but their 
actual roles and modes of interaction await experimental 
studies. 

The absence of the genes for several molecular chaper- 
ones, e.g. the HSP70 (DnaK), HSP90 and HSP40 (DnaJ) 
families, in M. jannaschii is particularly striking as these 
proteins are universally conserved in other genomes and 
appeared to be indispensable for any cell (Mushegian 
and Koonin, 1996a). These molecular chaperones are 
encoded by at least some archaea (Macario er al., 1991; 
1993; Gupta and Singh, 1994), including the dnaK gene 
in the methanogen Methanosarcina mazei (Macario ef 
al., 1 991 ). We performed a reverse search of the M. jan- 
naschii protein and nucleotide sequences with the sequ- 
ences of known molecular chaperones, in order to detect 
putative distant homologues. This allowed us to identify 
the most likely candidate for the GroES co-chaperonin, 
which has not been detected in the original report (Butt 
et al., 1996) or in our initial analysis of the M. jannaschii 
protein sequences. The detected candidate encoded by 
the MJ0073 gene, even though it shows only limited simi- 
larity to GroES, contains most of the amino acid residues 



that are typically conserved in the GroES proteins and 
aligns well with all the secondary structure elements deter- 
mined from the crystal structure of GroES (Hunt et al., 
1996; Fig. 4). 

However, no other candidate chaperones were revealed 
even by this additional analysis. Therefore, it is likely lhat 
the genes for molecular chaperones have been lost in the 
phylogenetic lineage leading to methanococci. As many 
molecular chaperones possess ATPase activity, one may 
speculate that, in M. jannaschii (and perhaps in other 
methanococci), at least some of the functions of the missing 
chaperones could have been relegated to a unique family 
of putative ATPases (Koonin, 1997, and see below). 

The differences in the repertoire of DNA repair proteins 
and molecular chaperones in M. jannaschii and bacteria 
appear to be manifestations of a general phenomenon, 
which we called 'non-orthologous gene displacement' 
(Koonin era/., 1996c). whereby the same essential func- 
tion Is performed by non-orthologous (that is, distantly 
related or completely unrelated) proteins In different 
organisms. In a comparison of M. genitaliurn and H. influ- 
enzae, we found that non-orthologous displacements 
Involved about 5% of the M. genitaliurn (the species with 
the smaller genome) genes (Koonin ef al., 1996c; Mushe- 
gian and Koonin, 1996a). In contrast, from the theoretical 
minimal gene set derived by comparison of the H. influen- 
zae and M. genitaliurn genomes and consisting of 256 
gene products (Mushegian and Koonin, 1996a), only 127 
(50%) showed significant sequence similarity to M. jan- 
nasc/7/'/proteins, and 90 (35%) were represented by appar- 
ent orthologues. A similar level of conservation was 
observed when the bacterial minimal gene set was com- 
pared with the yeast genome (Koonin and Mushegian, 
1996). It appears that there is massive non-orthologous 
displacement of essential genes between bacteria, 
archaea and eukaryotes. 

Families of paralogues in bacteria and archaea: the 
same main classes but significant differences among 
smaller families 

A significant fraction of genes in bacteria, namely about 
one-half in E. coll and about one-third in H. influenzae, 
belong to families of paralogues. i.e. genes coding for 
homologous proteins with related but not identical func- 
tions (Brenner et al., 1995; Koonin era/., 1995; Labedan 
and Riley, 1995; Tatusov era/., 1996; reviewed by Saier, 
1996). We found that 53% of the M. jannaschii gene pro- 
ducts belong to 194 families of paralogues, a fraction 
somewhat higher than that in H. influenzae, a bacterium 
with almost the same number of genes [the fraction of H. 
influenzae genes included in families of paralogues in 
this study is higher than that in the previous reports 
(Brenner etal., 1995; Tatusov era/., 1996), as we included 
© 1997 Blackwsll Science Ltd, Molecular Microbiology. 25. 619-637 
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Fig. 4. The candidate GroES co-chaperonin encoded by the genome of M. jannaschii. In the database search with the M jannaschii protein 
sequences, the MJ0073 sequence showed a P-value of 0.04 with the GroES sequence from Porphyromonas gingivalis, which was originally 
not considered significant. However, under the reciprocal procedure, when the M. jannaschii prolein sequences were searched with the GroES 
sequences MJ0073 had the lowest P-value of all M. Jannaschii proteins (<10- with the Porphyromonas gingivalis and < 10-" with several 
other GroES sequences). The alignment was constructed using the macaw program. The included sequences are the GroES homotagues Irom 
the four completely sequenced genomes, the E. coli GroES protein and the gene 31 product from bacteriophage T4, which possesses a 
co-chaperonm achvrty but is only distantly related to GroES (Koonln and Van der Vies, 1995). The consensus shows amino acid residues 
conserved in at least five of the six aligned sequences: the designations are as in Fig. 3. The secondary structure elements are Irom the 
crystal structure ol the E. coli GroES (Hunt era/., 1996); the dotted lines Indicate the two strands that form the mobile loop directly involved in 
the mteracbon o. GroES with GroEL The V in the MJ0073 sequence indicates the position of a trameshift that has been tentatively introduced 
in the M. /annasch,, nucleotide sequence, resulting in an N-terminal extension of MJ073 and allowing the inclusion in the alignment of the 
segment corresponding to the fi1 of GroES. 



distant similarities detected by blast2 and motif analysis) 
and similar to that in bacteria with larger genomes, e.g. 
Synechocystis sp. and £. coli (Table 6; Koonin et al., 
1 995, 1 996b). Among the families of paralogues in M. jan- 
naschii, 18 are unique, whereas the remaining majority is 
also represented in other archaea, bacteria and/or eukar- 
yotes. 

The largest classes of paralogues are the same in M. 
jannaschii and in bacteria, with the exception of the strik- 
ing abundance of Fe-S ox ido reductases among the M. 
jannaschii gene products (Table 6). The expansion of 
this enzyme class seems to be linked to the unique bio- 
chemistry of methanococci, as many of the Fe-S oxido- 
reductases are involved in methanogenesis. Interestingly, 
ATPases and GTPases with the 'Walker-type' NTP-binding 
motifs, NAD(FAD)-utilizing enzymes and helix-turn-helix 
DNA-binding proteins comprise nearly identical fractions 
of the gene products in M. jannaschii and H. influenzae 
(Table 5). This is all the more remarkable as only a minor- 
ity of the proteins in each of these superfamilies are ortho- 
logues; for example, even though M. jannaschii and 
H. influenzae encode nearly the same number of pre- 
dicted ATPases and GTPases with the 'Walker-type' 
ATP-binding motifs (124 and 128 respectively), only 39 
M. jannaschii proteins in this superfamily are represented 
by orthologues in H. influenzae. Furthermore, there are 
families within the largest superfamilies that are unique 
® 1997 Blackwell Science Ltd. Molecular Microbiology, 2S, 619-637 



to M. jannaschii. The most striking example is a family 
that includes 16 putative ATPases that we designated 
'MJ-type' ATPases (Table 6; Koonin, 1997) and that 
have originally been described as the largest unique pro- 
tein family in M. jannaschii (Bult et al., 1996). 

Structural features of bacterial and archaeal proteins: 
M. jannaschii encodes fewer membrane proteins and 
more non-globular proteins than bacteria 

We compared predicted structural features of bacterial, 
archaeal and yeast proteins, namely the number of pro- 
teins containing signal peptides and accordingly predicted 
to be secreted, those that contain predicted transmem- 
brane helices and are likely to be integral membrane pro- 
teins and those that contain coiled-coil domains and other 
non-giobular domains. The fraction of predicted trans- 
membrane proteins, including those that contain multiple 
transmembrane helices and are likely to be transporters, 
is remarkably similar in all three bacteria but is somewhat 
lower in M. jannaschii. Combined with a limited number of 
predicted transport ATPases, this may be a reflection of a 
lower diversity of transport systems, which is compatible 
with the autotrophic lifestyle of methanococci. Compared 
with bacteria, the archaeal proteome is clearly enriched 
in proteins containing coiled-coil and other non-globular 
domains; remarkably, about 7% of the M. jannaschii 
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Table 6. Large protein familio: 



l : Bmily/supertamily* 



Mushegian, M. Y. Galperin and D. R. Walker 

i and superfamllles and their ^presentation In bacterial and archaeal genomes. 
Number ol protelns(%) 



All families ot paralogies 


— — — — - 

918 in 194 


703 in 151 


165 in 


1 775 in 322 




families 


families 


46 families 


families 






(52.8%) 


(41.3%) 


(35.0%) 


(56.0%) 




ATPases and GTPases wilh 


124 (7.1%) 


128 (7.5%) 


56 (12.0%) 


1B4 (5.8%) 




•Walker-type' NTP-bindlng 
























Transport and repair 


18<1.0%) 


42 (2.5%) 


18 (3.8%) 


57(1.8%) 




ATPases 












MJ'ATPases 


19(1.1%) 


0 


0 


0 




Superfamily 1 helicases 


2 (0.1%) 


4 (0.2%) 


3 (0.6%) 


3 (0.1%) 




Superfamily H helicases 


14 (0.8%) 


12 (0.7%) 


3 (0.6%) 


10(0.3%) 




•MinD-IIke' ATPases 


13 (0.7%) 


1 (0.1%) 


1 (0.2%) 


1 1 (0.3%) 






19 (1.1%) 


16 (0.9%) 


14 (3.0%) 


31(1 .0%) 






88 (5.1%) 


22 (1.3%) 


0 


51 (1.6%) 




Helix— turn— helix DNA-blndlng 


44 (2.5%) 


49 (2.9%) 


3 (0.6%) 


55(1.7%)° 
















(primarily transcription 












regulators) 








117 (3.7%) 




NAD/FAD-utillzing enzymes 


42 (2.4%) 


40 (2.3%) 


1 1 (2.3%) 




SAM-dependent 


39 (2.2%) 


23 (1 .4%) 


B(1.2%) 


46 (1.5%) 




methyltransterases 








7 (0.2 X.) 




NTP-utillzing enzymes with 


18(1%) 


7 (0 .4%) 


4(0.9/c) 




the PP-motif 










enzymes with a diagnostic. 










modified P-loop (Bork and 












Koonin. 1994) 


CBS domains 


16 (0.9%) 


5 (0.3%) 


1 (0.2%) 


5 (0.2%) 


Consorved domain with an 












unknown function detected In 












a variety of proteins including 












cystathionine beta synthase 












and IMP dehydrogenase 














'MJ-type' 


12 (0.7%) 


1 (0.05%) 


0 


3 (0.1%) 


See text and Fig. 3A 


Zn-dependent hydrolase 

superfamily 1 
Zn(Ni)-dependent hydrolase 


15 (0.9%) 


3 (0.2%) 


1 (0.2%) 


9 (0.3%) 


See text and Fig. 3B 


12 (0.7%) 


3 (0.2%) 


0 


5 (0.2%) 




superfamily II (including 












adenine deaminase and 












dihydroorotase) 








83 (2.6%) 




Two-component system 


0 


6 (0.4%) 


0 




receiver domains 












Two-component system sensor 


0 


4 (0.2%) 


0 


43 (1.4%) 




domains (histidlne kinases) 













a. Ordered by the abundance in M. /annaschii; some ot the superfamilies consist of several distinct families. 

b. Numerous transposases that also contain the HTH domain are not included. 



proteins are predicted not to contain any globular domains 
(Table 7). 

A large fraction of archaeal proteins shows greatest 
similarity to bacterial homologues and a small 
fraction is most similar to eukaryotic homologues 

Archaea are considered to be, along with bacteria and 
eukaryotes, one of the three primary domains of life. 
Phylogenetic analysis of rRNA as well as of several pro- 
teins, primarily those involved in translation and transcrip- 
tion, suggested the grouping of archaea with eukaryotes 
as opposed to bacteria (Woese and Fox, 1977; Woese, 



1987; Woese et a/., 1990). The root of the universal tree 
has been placed between the eukaryotic/archaeal and 
the bacterial lineages by phylogenetic analysis of univer- 
sally conserved pairs of paralogues (Iwabe er at., 1989; 
Gogarten et a/., 1989; 1996). It has been observed before 
that some archaeal proteins group with bacterial proteins 
and, in particular, with those from Gram-positive bacteria, 
in phylogenetic trees (Gupta and Golding, 1995, 1996; 
reviewed by Gogarten et al., 1996). 

The complete genome analysis, however, provides a 
new perspective on 'eukaryotic' and 'bacterial' genes in 
archaea. Given the presumed archaeal-eukaryotic asso- 
ciation, it is striking that 44% of the M. jannaschii protein 
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Table 7. Predicted structural leatures ol bacterial and archaeal proteins. 



helix- 



Proteins with multiple 
transmembrane helices" 
Proteins with colled-coll 

domains 0 
Non-globular domain-containing 
proteins" 
Proteins without globular 
domains 







M. g&nitftlium 


Synechocyslis sp 


16 (0.9%) 
320 (18.4%) 


152 (8.9%) 
28(1.6%) 
379 (22.5%) 


25 (5.3%) 
16 (3.4%) 
115 (24.6%) 


246 (7.6%) 
30(1.0%) 
800 (25.3%) 


153 (8.8%) 


226 (13.3%) 


52 (11.1%) 


363 (11.6%) 


635 (36.6%) 


403 (23.77 u ) 


154(32.9%) 


652 (20.6%) 


772 (44.4%) 


312 (18.3%) 


169(36.1%) 


851 (26.9%) 


124 (7.2%) 


32(1.9%) 


18 (3.9%) 


63 (2.0%) 



a. The predictions were made in the above order (from top to bottom); residues for which any feature was predicted were masked, i.e. removed 
from consideration for further predictions. 

b. This class included proteins that contained at least one predicted transmembrane helix other than a signal peptide. 

c. Proteins with at least lour predicted transmembrane helices. 

d. Predicted coiled-coll domains of less than 21 residues were not included. 

e. Predicted non-globular domains of less than 20 residues wore disregarded. Regions of 30 or fewer residues bounded by non-globular domains 
or by a non-globular domain and the sequence terminus were merged into the adjacent non-globular domain(s). Accordingly, proteins containing 
no regions longer than 30 residues between non-globular domains were considered to contain no globular domains. 



sequences show a significantly higher similarity to their 
bacterial homologues than to eukaryotic homologues as 
opposed to the mere 13% that are more similar to eukaryo- 
tic homologues (Fig. 5A; the remaining 43% of the proteins 
either show approximately the same level of similarity to 
the bacterial and archaeal homologues, have only archaeal 
homologues or have no homologues at all). The number of 
M. jannaschii proteins that have detectable homologues 
only among bacterial proteins is also considerably greater 
than the number of proteins that are similar only to eukar- 
yotic proteins (Fig. 1). Importantly, a qualitatively similar 
breakdown of the proteins into those with the greater simi- 
larity to bacterial homologues and those most similar to 
eukaryotic homologues was detected among the 275 dis- 
tinct protein sequences from the archaeal genus, Sulfolo- 
bus, currently available in the databases (Fig. 5A). As 
Meihanococcus and Sulfolobus belong to the two principal 
phylogenetic divisions of the archaea, Euryarchaeota and 
Crenarchaeota, respectively (Pace, 1997), it appears 
likely that the observed quantitative prevalence of 'bacter- 
ial' proteins is typical of all archaea. 

Beyond this general balance, there is a strong contrast 
between different functional classes of proteins (Fig. 5A). 
Translation and transcription are predominantly eukaryo- 
tic, although bacterial-type transcription regulators con- 
taining the helix-turn-helix DNA-binding domain are 
abundant (Table 5). The replication, recombination and 
repair class is quantitatively dominated by 'bacterial' pro- 
teins, but these are mostly accessory proteins, such as 
endonucleases and DNA methylases. In contrast, the key 
components of the replication machinery only have ortho- 
logues in eukaryotes (see above). The protein secretion 
apparatus appears to be hybrid, consisting of homologues 
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of both bacterial and eukaryotic secretion proteins. All the 
other functional classes, including the large group of pro- 
teins for which only a general functional prediction was 
possible, are dominated by 'bacterial' proteins (Fig. 5A). 

Thus, the archaeal gene complement consists of a 
majority of genes most similar to their bacterial homo- 
logues and coding primarily lor metabolic enzymes, trans- 
port systems and enzymes of cell wall biogenesis, and 
a minority of genes with the closest similarity to their 
eukaryotic counterparts, which typically encode proteins 
involved in genome expression. 

Detailed discussion of the evolution of the three 
domains of life is beyond the scope of this work. Much 
more data and careful analysis is required for a conclusive 
picture to emerge, and here we only briefly discuss differ- 
ent scenarios that may account for the mosaic compo- 
sition of the archaeal gene sets. Two fundamentally 
different types of explanation seem possible: (i) major var- 
iations in evolutionary rates for different groups of genes in 
different lineages; and (ii) genome fusion and/or horizontal 
gene transfer accompanied by gene loss. A rate variation 
scenario would posit that those groups of genes that are 
conserved in archaea and bacteria (e.g. the majority of 
metabolic enzymes) underwent a dramatic increase in 
the evolutionary rate in the eukaryotic lineage shortly 
after its separation from the other two lineages. Con- 
versely, the group of genes that appear eukaryotic in 
archaea (mainly translation, transcription and replication 
components) should have had a phase of rapid change 
in the early evolution of bacteria. These hypothetical 
epochs of rapid change should have been brief on the 
evolutionary scale, as both groups of proteins are highly 
conserved even among deeply branching lineages within 
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the bacterial and the eukaryotic domain. While technically 
possible, such complementary evolutionary explosions 
appear unlikely. 

Thus, a scenario including genome fusion and/or 
horizontal gene transfer, accompanied by gene elimina- 
tion, seems to be the most realistic explanation of the 
observed distribution of sequence similarities among 
archaeal protein sequences. Strong evidence for horizon- 
tal gene transfer may come from specific relationships 
between archaeal genes and genes from a particular 
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bacterial lineage, e.g. Gram-positive bacteria (Gupta and 
Golding, 1995; 1996). However, our analysis of the com- 
plete set of M. jannaschii protein sequences indicates 
that most of them show approximately the same level 
of similarity to homologues from Gram-positive and 
Gram-negative bacteria (Fig. 5B). Among those archaeal 
proteins that do show significantly higher similarity to 
homologues from a particular bacterial lineage, affinity 
with Gram-negative bacteria is more frequent (Fig. 5B 
and C). These preliminary observations may be most 
consistent with a merger between an ancestral bacterium, 
antedating the radiation of the main bacterial lineages, and 
an ancestral cell from the lineage that gave rise to the 
eukaryotic nucleocytoplasm, followed by differential gene 
loss. The subsequent evolution of archaea might have 
included multiple additional events of horizontal transfer 
of bacterial genes; hence, genes specifically related to 
homologues from Gram-negative bacteria, Gram-positive 
bacteria orcyanobacteria (Fig. 5C). This view of the prob- 
able evolutionary history of archaea invokes an obvious 
analogy with eukaryotes, which acquired a great number 
of bacterial genes as a consequence of mitochondrial 
and chloroplast endosymbiosis. 

Conclusions 

Regardless of the phylogenetic position of the organism, 
sequence similarity to proteins from other species could 
be detected and a function could be predicted, at least at 



Fig. 5. Bacterial and eukaryotic homologues of archaeal proteins. 

A. Proteins with highest similarity to bacterial and eukaryotic 
homologues in M. jannaschii and Sulfolobus 

The M. jannaschii proteins were classified by functional category 
as in Table 2 (RHR, replication, recombination and repair; aa. 
amino acid metabolism and transport; membr, membrane 
biogenesis, chaps, molecular chaperones and proteins with related 
functions; general, proteins for which only general functional 
prediction was possible). For eacti of the archaeal proteins, the 
best bacterial and the best eukaryotic hit were detected using the 
dlatax program, after which these hits were examined individually. 
A protein was considered to be significantly more similar to its 
bacterial homologue than to the eukaryotic homologue, or vice 
versa, if the difference in the percentage identity in the best 
alignments reported by the wublastp program was at least five 
points and/or the difference in the reported P-values was at least 
several orders of magnitude; additionally, the conservation of 
domain organization was taken Into account when assigning a 
protein to one of the classes. 

B. Scatter-plot of the similarity scores between M. jannaschii 
proteins and their homologues from Gram-negative and Gram- 
positive bacteria. The axes show the similarity scores reported by 
the wublastp program. The data for M. jannaschii proteins that had 
a score of at least 90 with proteins from each of the bacterial 
lineages are Included. 

C. Classification of the 'bacterial' proteins from M. jannaschii by 
sequence similarity to homologues from three major bacterial 
lineages. The sequences of the 745 M. jannaschii proteins 
classified as •bacterial' were further analysed using the same 
criteria as in A. 

® 1997 Blackwell Science Ltd, Molecular Microbiology, 25, 619-637 



Exhibit B: Page 14 of 19 



Archaeal and bacterial genome comparison 633 



a general level, for a large majority of the gene products. 
The fraction of proteins containing regions conserved 
over long phylogenetic distances is approximately the 
same in bacteria and archaea and is close to 70%. 
Thus, the application of sensitive methods and detailed 
analysis of conserved motifs makes the archaeal gen- 
omes as amenable to meaningful interpretation by com- 
puter as bacterial genomes. 

The archaeal genomes encode many more proteins with 
a significantly higher similarity to bacterial homologues 
than proteins, for which the closest homologue is eukaryo- 
tic. This mosaic composition of the archaeal gene set, 
together with the relatively small fraction of M jannaschii 
proteins that have orthologues among the genes in each 
of the other completely sequenced genomes, is compat- 
ible with the notion of archaea as a distinct domain of 
life. In a similar fashion to that by which eukaryotes 
acquired a number of bacterial genes as a consequence 
of mitochondrial and chloroplast endosymblosis, the evol- 
ution of archaea probably included at least one major mer- 
ger between ancestral cells from the bacterial lineage and 
the lineage leading to the eukaryotic nucleocytoplasm. 

Experimental procedures 

Nucleotide and protein sequences and databases 

The nucleotide sequence of the H, influenzae genome was 
from Fleischmann etal. (1995), the M. genitalium sequence 
was from Fraser et at. (1 995), the M. jannaschii sequence 
was from Bult ot at. (1996) and the Synechocystis sp. sequ- 
ence was from Kaneko et al. (1996). The gene complements 
of H. influenzae and M. genitalium were re-evaluated as 
described previously (Tatusov et al., 1996; Mushegian and 
Koonin, 1996a), resulting in 1703 and 468 protein-coding 
genes respectively. There was no attempt to reassess the 
gene identification in M. jannaschii Systematically, but the ori- 
ginally reported intergenic regions (Bult et al., 1996) were 
compared with protein sequence databases using the blastx 
program (see below), resulting in the identification of five pre-, 
viously undetected genes. In addition, two groups of three 
genes and six pairs of genes from the originally described 
gene set (Bult ef al., 1996) were identified as originating 
from a single gene disrupted by frameshifts. The resulting set 
of M. jannaschii genes used for the present analysis thus con- 
sisted of 1 73 1 genes. The gene complement of Synechocystis 
sp., which includes 3168 genes, was used as described ori- 
ginally (Kaneko era/., 1996). 

All database screening was against the protein and nucleo- 
tide versions of the non-redundant (NR) sequence database 
maintained at the National Center for Biotechnology Informa- 
tion (NIH, Bethesda, MD, USA). 

The information on biochemical pathways was, in part, from 
the WIT database (http://www.cme.msu.edu/WIT/), the Boeh- 
ringer Mannheim metabolic map (http://expasy.hcuge.ch/cgi- 
bin/search-biochem-index) and the Kyoto Encyclopedia of 
Genes and Genomes (http://www.genome.ad.jp/kegg/ 
kegg2.html). 

■8 1997 Blackweil Science Lid. Molecular Microbiology, 25. 619-637 



Database searches and protein sequence comparisons 

Searches of the protein version of the NR database woro per- 
formed using both the blastp program (Altschul et al., 1990) 
and the wublastp program based on the blast2 algorithm 
(Altschul and Gish, 1996). In the blast2 algorithm, the extreme 
value distribution statistics for a single local alignment and the 
sum statistics for multiple compatible alignments, which have 
been used originally for blast, are generalized to include 
gapped alignments, resulting in a significantly higher search 
sensitivity (Altschul and Gish, 1996; E. V. Koonin and A. R. 
Mushegian, unpublished observations). Therefore, the results 
of the wublastp searches were generally used as the basis for 
assessing sequence similarity. Low-complexity regions in 
protein sequences, which frequently produce spurious hits 
in database searches, were masked before the database 
search using the beg program (Wootton and Federhen, 
1 996). Under these conditions, the P-value of 0.001 or below 
produced by the wublastp program was considered a strong 
indication ot homology. The relevance of the search results 
was evaluated essentially as described previously (Koonin 
ef al., 1996b), with particular attention given to those that 
were associated with P-values greater than 0.001 . The con- 
sistency of alignments between the query sequence and dif- 
ferent database sequences was assessed using the cap 
program (Tatusov ef al., 1994), multiple alignment analysis 
and visual Inspection. The conservation of patterns from the 
PROSITE database (Bairoch, 1996) in biast outputs was 
determined using the bla program (Tatusov and Koonin, 
1994). The ECMOT collection produced in the course of the 
analysis of E. coll protein sequences (Koonin et al., 1995) 
was used as an additional source of protein motifs. Given the 
non-transitivity of database searches, in cases of a small 
number of matches in the database or matches to sequences 
of uncharacterized proteins only, additional iterations of data- 
base screening using the wublastp program were performed 
(Koonin and Tatusov, 1994). New protein motifs and multiple 
alignments derived from database searches were used to 
screen the NR database with the programs most (Tatusov et 
al.. 1994) and HMMer(Eddy etal., 1995) respectively. An align- 
ment of a query sequence with a database sequence was 
considered relevant if it had an associated P-value of less 
than 0.001 and/or contained a known or new unique motif(s). 

Screening of the protein sequence database with nucleo- 
tide sequences translated in six frames in order to detect pre- 
viously unidentified genes was performed using the blastx 
program (Altschul etal., 1990). Conversely, nucleotide sequ- 
ence databases translated in six frames were screened with 
protein sequences using the tblastn program (Altschul ef 
al., 1990). 

Multiple alignments of protein sequences were constructed 
using the macaw program (Schuler ef al., 1991). 

The results of database searches were classified according 
to their taxonomic origin using the blatax program (Koonin et 
al., 1996b). 



identification of orthologues and paralogues 

A database consisting ot the sequences of all gene products 
from M. jannaschii, H. influenzae, M. genitalium, Synecho- 
cystis sp. and S. cerevisiae was compared with itself using 
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the wublastp program, and consistent groups of potential 
orthologues were delineated using the rog (Rows of Ortholog 
Groups) program (R. L. Tatusov, E. V. Koonin and D. J. Lip- 
man, in preparation). The putative orthologous relationships 
were further examined case by case in order to verify the 
statistical significance of the sequence similarity and the con- 
servation of the domain organization between the candidate 
orthologues (Tatusov et al.. 1996). 

Conserved strings of genes in different genomes (with pos- 
sible gaps) were detected using the genestring program 
(Tatusov et al., 1996) and the produced lists of orthologues. 

In order to identify clusters (families) of paralogues in each 
of the species, single-linkage clustering of protein sequences 
was initially performed using the clus program (Koonin et al.. 
1 996b), based on the results of blastp searches. A blastp score 
of 70 was chosen as the cut-off for clustering, with low-com- 
plexity regions in protein sequences masked using the seg 
program. The resulting protein families were further expanded 
based on the results of wublastp comparisons and motif ana- 
lysis. The consistency of the alignment in each of the families 
was verified using the cap program, multiple alignment analy- 
sis and/or visual inspection of the wuei astp search results. 



Analysis of structural features of proteins 

Signal peptides were predicted using the signalp program 
(Nielsen et al., 1997). Predicted signal peptides containing 
more than 35 amino acid residues were ignored. Lipoproteins 
were identified using the gref program (Walker and Koonin, 
1997), according to the criteria described by Sankaran et 
al. (1 995). Transmembrane helices were predicted using the 
pMotopology program (Rost et al.. 1995; Rost, 1996). Pre- 
dicted helices shorter than 17 residues were disregarded. 
Coiled-coil regions were identified using the coils2 program 
(Lupas, 1996). Non-globular domains were predicted using 
the seg program with the parameters 45 (window length), 3.4 
(trigger complexity) and 3.75 (extension complexity) (Woot- 
ton and Federhen, 1996). All the structural features were pre- 
dicted in batch mode for complete sets of proteins from each 
species, and the results were automatically integrated using 
the UniPred program (Walker and Koonin, 1997). 



Prediction of protein function 

Protein functions were inferred by detailed inspection of the 
results of database searches, motif analysis, multiple align- 
ments and structural predictions. For each protein, an attempt 
was made to predict the function or activity at the appropriate 
level of precision in order to avoid both overprediction and 
omission of relevant information. The database annotation 
attached to the protein with the highest similarity to the given 
query was not automatically considered applicable, even if 
the similarity was highly statistically significant (cf. Bork and 
Bairoch, 1996), and a conservative approach was adopted 
in general. For example, for transport proteins, the substrate 
was predicted only in cases when the similarity to a permease 
or transport ATPase with a known specificity was much higher 
than the similarity to proteins from other transport systems. 
Analogously, for such widespread enzymes with diagnostic 
conserved motifs as S-adenosyl methionine (SAM)-depen- 
dent methyltransferases or different classes of hydrolases, 



the specificity was predicted only in cases when a clear ortho- 
logue with a known specificity was available. The consistency 
of functional predictions for orthologues and members ot 
families of paralogues was ensured. 

In cases when a protein responsible for a particular function 
could not be identified in a genome on the basis of the initial 
sequence analysis, a reverse search procedure was applied, 
whereby a set of sequences of proteins with the respective 
function was compared with all protein sequences encoded 
in the given genome as well as the complete nucleotide 
sequence translated in six frames. 



Availability of the results 

The detailed results of the computer analysis of complete bac- 
terial and archaeal genomes are available through the World 
Wide Web (URL:http://www. ncbi.nlm.nih.gov/Complete_ 
Genomes). 
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Note added in proof 

After this manuscript was submitted, we became aware of 
a publication that reports the detection of homologues for 
214 additional (compared with the original report) proteins 
from M. jannaschii, increasing the fraction of proteins with 
recognized similarities to 54% [Kyrpides, N.C., Olsen, G.J., 
Klenk, H.-P.. White, O.. and Woese, C.R. (1996) Methano- 
coccus jannaschii genome: revisited. Microbial Comparative 
Genomics 1 : 329-338]. 
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